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We present a new implementation of the Fourier acceleration method for Landau gauge fixing. 
By means of a multigrid inversion we are able to avoid the use of the fast Fourier transform. This 
makes the method more flexible, and well suited for vector and parallel machines. We study the 
performance of this algorithm on serial and on parallel (APEIOO) machines for the 4-dimensional 
SU{2) case. We find that our method is equivalent to the standard implementation of Fourier 
00 \ acceleration already on a serial machine, and that it parallelizes very efficiently: its computational 

0^ ■ cost shows a linear speedup with the number of processors. We have also implemented, on the 

parallel machines, a version of the method using conjugate gradient instead of multigrid. This leads 
to an algorithm that is efficient at intermediate lattice volumes. 
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~~' I. INTRODUCTION 

(N , 

^ , Lattice gauge fixing is a necessary tool for understanding the relationship between continuum and lattice gauge 
theory. In fact, due to asymptotic freedom, the continuum limit of the lattice theory is the weak-coupUng Umit, and a 
weak-coupling expansion requires gauge fixing. Gauge fixing is also used in smearing techniques, and is necessary in 
order to evaluate quark/gluon matrix elements which can be used to extract non-perturbative results from Monte Carlo 
simulations [|l| . It is therefore important to devise numerical algorithms to efficiently gauge fix a lattice configuration. 
I An important issue regarding the efficiency of these algorithms is the problem of critical slowing- down (CSD), which 
0^ occurs when the relaxation time t of an algorithm diverges as the lattice volume is increased M . Conventional local 
algorithms have dynamic critical exponent z « 2, namely r grows with the lattice side N roughly as iV^. Improved 
^ ' local methods show typically z « 1, while global methods may succeed in eliminating CSD completely, i.e. z « 0. 
Usually, global algorithms are more costly per iteration than local methods but, due to the elimination of CSD, their 
total computational cost becomes progressively lower than that of local methods at large lattice volumes. For this 
(-H \ reason, efficient global algorithms are a highly desirable tool in large-volume applications. Another important issue is 
l^' > whether gauge-fixing algorithms can be implemented efficiently on parallel machines |^ , since computers of this type 
are widely used in numerical studies of gauge theory on large lattices. 

A well-known global approach for reducing CSD, applicable to gauge fixing as well as to other problems, is the 
^ ' method of Fourier acceleration (FA) |^ . The idea is to precondition a problem using a diagonal matrix in momentum 
space which is related to the solution of a simplified version of the problem Q,^. For the SU{2) Landau gauge fixing 
case it can be proved |^ that Fourier acceleration eliminates CSD completely at infinite /3, namely the dynamic 
critical exponent z is equal to zero; we have also obtained ||^, at finite /3 and in two dimensions, that z is equal to 
0.036 ±0.064. Moreover, the FA gauge-fixing method is very efficient in achieving a constant value for the longitudinal 
gluon propagator at zero three-momentum |^-^, which provides a very sensitive test of the goodness of the gauge 
fixing. 

To fix Landau gauge on the lattice one looks for a minimum of the functional |9| 
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^t^tsl^^E^r [1 - g{x)U^ix)g^ix + e^)] . (1) 
(We refer to for notation.) The FA update is given by ^("^"'^(x) = R{x) g'^°^'^\x) with 
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Here _F is a Fourier transform, a is a tuning parameter, p^ik) is the square of the lattice momentum, and V ■ A is the 
lattice divergence of the gluon field A^. Thus, in this case, the preconditioning is obtained using in momentum space 
a diagonal matrix with elements given by l/p'^{k) (see Section II] for details). The FA method in its original form 
makes use of the fast Fourier transform (FFT) to evaluate F and F~^, which requires a work of only V log V where V 
is the lattice volume [Q, making it very appealing from the computational point of view. On the other hand, in order 
to implement FFT efficiently, one is restricted to using lattice sides that are powers of 2 0, Chapter 12]. Mor cover, 
implementing FFT on parallel machines of the SIMD type, and especially in 4 dimensions, can be very cumbersome 
1^,0. Here we present a new implementation of the FA method for Landau gauge fixing which avoids completely 
the use of the fast Fourier transform, and we test its performance for the 4-dimensional SU{2) case on serial and on 
parallel machines. Preliminary results have been reported in |12[. 



II. THE MULTIGRID FA METHOD 

Let us start by noting that the Fourier-mode decomposition in eq. (|^) is equivalent to an inversion of the lattice 
Laplacian operator A: 

F-'p-\k)F = i-A)-' . (3) 

(Note that this inversion is carried out for each component of V • A.) Thus, the FFT can be avoided by inverting 
A using an alternative algorithm. Clearly, a good candidate should be suitable for use on parallel machines, and 
should require, ideally, the same computational work as FFT, i.e. VlogV. One such candidate is the multigrid 
(MG) algorithm with W-cycle and piecewise-constant interpolation. The idea of MG is to solve the lattice problem 
recursively, using local (Gauss-Seidel) updates on coarser versions of the original lattice in order to accelerate the 
convergence of slow modes of the solution. MG is an efficient iterative routine for inverting the Laplacian A: with 
our choice of cycle and interpolation, each iteration of the method represents a work of order V, and the number 



of iterations required for convergence is proportional to logV at most [y_3 14|. Moreover, the MG routine can be 
successfully implemented on vector and parallel machines [ p^ . Thus this approach should preserve the property of 
eliminating CSD for Landau gauge fixing, while being applicable on parallel machines. At the same time, there is 
no restriction on the lattice size since, even for a fixed coarsening factor (e.g. 2), the size of the coarsest grid can be 
adjusted conveniently. 

The overhead for the MG routine is likely to be larger than the one for FFT, but in our case it can be reduced by 
exploiting the fact that multigrid (as opposed to FFT) is an iterative method. For example, a significant computational 
gain can be obtained if one starts the inversion from a good initial guess for the solution. Also, by changing the stopping 
criterion for the inversion, the accuracy of the solution can be suitably varied, while for FFT the accuracy is fixed by 
the precision used in the numerical code. This is important since we will be tuning the parameter a in eq. (H), and 
this tuning can be done only up to an accuracy of a few percent. Thus, the inversion of A most likely will not require 
the high accuracy obtained in the FFT case, making possible a substantial reduction of the computational cost. 

In order to test the feasibility of this approach we have started our simulations on a workstation, comparing the 
performance of the MG implementation of the Fourier acceleration method (MGFA) with the original implementation 
using FFT (which we call from now on FFTFA) As a first step, we tuned the parameter a for the FFTFA 

method at infinite f3 and V = 8^. We obtained aopt = 1.105 as optimal choice, and a number of gauge-fixing sweeps 
equal to 14.9(2). (Note that our data represent averages over a set of gauge configurations, and that our error bars 
are one standard deviation.) We then tested MGFA with a = aopt- in addition to the W-cycle (where each grid is 
visited twice before proceeding to the next finer grid), we also tested for comparison the V-cycle (where each grid 
is visited once before proceeding to the next finer grid) and the standard Gauss-Seidcl update; for the W-cycle we 
also varied the number of Gauss-Seidel relaxation sweeps on each grid (Nr), and the minimum number of complete 
multigrid cycles (Nmin)- Results for the number of gauge- fixing sweeps as a function of the accuracy [ p7| , at infinite 
f3 and V = 8"*, are reported in Fig. [l[ By comparing these results with the result from the FFTFA method, it is 
clear that the number of gauge-fixing sweeps is independent of the method used for the inversion of A, provided that 
a high enough accuracy is required. Among the tested versions of the MGFA method, the best from the point of 
view of computational cost is the one with the following choice of parameters: 7 = 2, an accuracy of about 10~^, 
two relaxation sweeps on each grid, and a minimum of two multigrid cycles for each inversion of A. (We note that 
the CPU cost per iteration of this MG version is higher than for the other versions, but the fact that the inversion 
of the Laplacian can be stopped already at an accuracy of 10"^ makes it the fastest version.) We then adopt this 
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version of MG as the routine used for our MGFA method. For the MG routine we have chosen to use a coarsest grid 
of 2^. Nevertheless we have checked that the performance of the chosen version does not change if a coarsest grid of 
4"* is used. This is an important result, as we will see, for the implementation of the algorithm on parallel machines. 
We have also tested different initial guesses for the MG inversion, finding that the use of the solution to the previous 
inversion of the Laplacian is preferable to a null or random initial guess. 

III. RESULTS AND CONCLUSIONS 

Initially, we have tested the performance of the MGFA method at /? = oo and aX (3 = 2.2 on a workstation. The 
results, reported in Table |, are compared with those obtained for the FFTFA method and for the overrelaxation 
(OVE) algorithm [^, an improved local method which shows z « 1 and which is often used for production 
runs in lattice gauge theory. In all cases the stopping criterion for the gauge fixing was (V • A)"^ < 10~^^. From the 
data at /? = oo we can study the volume dependence of the computational cost for the various algorithms. Clearly, 
FFTFA and MGFA have a similar performance showing a number of gauge-fixing sweeps increasing less than 
logarithmically with the volume V, and the CPU time per sweep increasing roughly as V\ogV. From our data we 
note that the number of MG cycles per inversion is essentially independent of the volume. As expected, for the 
overrelaxation algorithm the number of gauge-fixing sweeps is proportional to the lattice side N , and the CPU time 
per sweep grows as the volume TV"'. The data for the total CPU time for the two implementations of the FA method 
and for the overrelaxation method are well fit by VlogV and N^^ respectively. An extrapolation of our data using 
these fits predicts that either version of the FA method would be less costly than the overrelaxation method already 
at lattice side N = 24. Actually, the CPU cost for MGFA scales slightly better with the volume than for FFTFA. 
Thus, on a serial machine, and at (3 — oo, the fast Fourier transform can be successfully replaced by the MG routine, 
and MGFA should be the method of choice in the limit of large lattice volume. MGFA is equivalent to FFTFA also 
at /3 = 2.2 and = 8"*, and therefore it appears that the use of FFT can be avoided also at finite (3. 

We remark that the FA method may have convergence problems at low values of (3 , probably related to the 
large number of local minima of the functional EuWi and/or to the existence of topological objects on the lattice. 
These problems are more likely to affect a global method than a local one (such as overrelaxation). A possible 
solution may be the smearing approach recently introduced in p^ ] : by smoothing out the lattice gauge configuration, 
one can perform gauge fixing at an effective (3 close to infinity; this result is then used as a preconditioning of the 
original gauge-fixing problem, i.e. the gauge-fixing iterations start already close to a minimum. This approach aims 
at eliminating the problem of Gribov copies on the lattice, and is very well suited for the FA method. In fact, the two 
gauge-fixing steps involved (the one for the preconditioning — at high j3 — and the one starting close to a minimum) 
are ideal applications of FA. 

We then applied MGFA on two parallel (APE-Quadrics) machines of the APEIOO series: the Ql (2'^ processors) 
and the QH4 (8'^ processors) In order to implement the method on a parallel machine, the idea is to use as the 
coarsest grid for the MG routine a grid with volume equal to or larger than the number of processors. This avoids 
the problem of idle processors discussed in ref. |l^. For example, for the Ql we implemented the MG routine with 
coarsest grid 4^ (respectively 6'*) for the lattice volume F = 8'' (respectively V = 12''), while for the QH4 we used 
8"^ as the coarsest grid |Q . We have checked that the number of gauge- fixing sweeps does not change if we use for 
the MG routine an accuracy in the range 10~''-10"^. This confirms the result obtained for a serial machine (see Fig. 

In all our runs on APE machines we set the accuracy for the inversion to 5 x 10 ^ . Since these machines work in 
single precision we have also decreased the stopping criterion for the gauge fixing to (V • A)^ < 10^^. 

As an alternative to the MG inversion, we have also implemented a version of the Fourier acceleration method in 
which the Laplacian A is inverted using a standard (iterative) conjugate-gradient (CG) method. We call this method 
CGFA. The CG algorithm is simpler to program, and has been widely used on parallel machines. Its CPU time per 
cycle should be smaller than the one for MG, but the number of iterations required in order to achieve a fixed accuracy 
should increase faster than for the MG routine. In fact, we have checked that the number of multigrid iterations is 
essentially independent of the lattice volume, while the number of CG iterations grows roughly as A^" '^^. 

In Table |l| we report the number of gauge-fixing sweeps obtained, a.t f3 — oo and for several lattice sizes, for 
the MGFA and the CGFA methods on APE machines. Their performances are compared with that of a standard 
overrelaxation (OVE) and with that of an unaccelerated local algorithm (the so-called Los Alamos algorithm, LOS) 
[^,0 . These runs on parallel machines confirm that local algorithms are usually not able to achieve a constant value for 
the longitudinal gluon propagator at zero three-momentum. This can be checked by changing the stopping criterion 
for the gauge fixing: instead of considering (V • A)'^ we can consider the quantity eg defined in which monitors 
the fluctuations of this gluon propagator. Results are reported in Table || for the lattice volume V = 32"*. (Note that 
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for LOS and OVE the results, in this case, have a large statistical error, due to the fact that for some configurations 
the gauge fixing did not converge within 3000 sweeps.) 

Also in the parallel case, the FA method eliminates CSD at /3 = cxo, and therefore should be the method of choice 
on large lattices. With respect to the overrelaxation method, the two implementations of FA are competitive already 
at volume 32^ if one requires a sensitive stopping criterion for the gauge fixing. We observe that, at our lattice sizes, 
the CG implementation is about two times faster than MGFA, but at large volumes we expect MGFA to win out. 

We recall that local methods are more efficiently implemented on parallel machines than global ones, since they 
require smaller communication between processors. Global methods need implementations specifically designed for 
parallel machines in order to achieve a significant reduction of their computational overhead. For example, in a parallel 
implementation, the update for MG is not exactly of the Gauss-Seidel type, and in fact we observe that a higher 
(fixed) number of MG iterations is needed ||2^. We think that our code for the MG routine can be optimized, since 
we have not explored more advanced features of MG that can play a role on parallel machines, such as asynchronous 
multigrid and the use of accommodative cycles instead of a fixed cycling strategy. 

We have checked the dependence of the performance of the algorithms on the number of processors using the data 
from the Ql and from the QH4, respectively for lattice volume 12'* and 32"*. The CPU time per gauge-fixing iteration 
per site scales down by a factor of approximately 62 for all the four algorithms, showing that the two FA methods 
parallelize as efficiently as the local ones. (Note that the number of processors increases by a factor 64 going from the 
Ql to the QH4.) 

We have shown that the fast Fourier transform in the Landau gauge-fixing Fourier acceleration method can be 
successfully substituted — on serial as well as on parallel machines — by an alternative inversion routine. This idea 
can in principle be extended to other applications of FA such as the case of quark propagators, and the Monte Carlo 
method for thermalization of lattice configurations. 
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FIG. 1. Plot of the number of gauge-fixing sweeps for the MGFA algorithm as a function of the accuracy imposed on the 
inversion of the Laplacian. Simulations were carried out on a workstation, for lattice volume 8^ and at P — oo. The different 
types of multigrid cycle employed are: (a) Gauss-Seidel update (O), l/-cycle with A''^ = l,Nmin ~ 1 (0)i W-cyc\e with 
Nr = l,Nmin = 1 (O) , VK-cyclc with Nr = 2, Nmin ~ 1 (*); (b) a closcr view of the W-cyc\e with N,. = 2,Nmin ~ 1 case 
(solid line), together with the ly-cycle with A^,- = 2, Nmin ~ 2 case (x). In the second plot errors are omitted for clarity. We 
note that for FFTFA the number of gauge-fixing sweeps is 14.9 ± 0.2. In all cases we set a = 1.105, and we stopped the gauge 
fixing when the condition (V • A)^ < 10~^° was satisfied. 
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TABLE I. Comparison between the FFTFA and MGFA algorithms at — oo and at /3 = 2.2 on a workstation. The optimal 
choice for the tuning parameter a for the two FA methods is reported llq]. We also quote here results for the overrelaxation 
(OVE) method. 



algorithm 




V 


Ctopt 


GF sweeps 


— 

CPU time (s) 


OVE 


oo 


4 




18.1 ± 0.1 


0.167 ± 0.003 


r r IrA 


oo 


4 


1 1 O 


t n o 1 no 
iO.D ± U.O 


U.o4 ± U.Ui 


TV JV/~^ T71 A 

MGt A 


oo 


4 


1.12 


13.7 ± 0.3 


0.55 ± 0.02 


OVE 


oo 


8 




37.5 ± 0.2 


5.91 ± 0.03 


FFTFA 


oo 


8* 


1.12 


16.1 ±0.3 


9.60 ±0.20 


MGFA 


oo 


8* 


1.12 


16.2 ±0.3 


11.4 ±0.5 


OVE 


oo 


16* 




76.6 ±0.9 


194 ± 2 


FFTFA 


oo 


16* 


1.1 


20.0 ±0.4 


246 ±6 


MGFA 


oo 


16* 


1.1 


20.3 ±0.4 


242 ± 10 


OVE 


2.2 


8* 




139 ±8 


22.5 ± 1.4 


FFTFA 


2.2 


8* 


1.89 


333 ± 27 


200 ± 12 


MGFA 


2.2 


8* 


1.89 


312 ±25 


187 ± 15 



TABLE IL Results for the so-called Los Alamos (LOS) method, the overrelaxation (OVE) method, the CGFA and MGFA 
algorithms at 13 = oo. Runs were carried out on the Ql and QH4 machines, for the lattice volumes 1/ = 8*, 12* and V = 16*, 32* 
respectively. The optimal choice for the tuning parameter a for the two FA methods is also reported. In all cases we stopped 
the gauge fixing when the condition (V ■ A)^ < 10~^ was satisfied. For the second set of data at V = 32* the condition 
eg < 5 X 10~® has been used. 



algorithm 


V 


Oopt 


GF sweeps 


CPU time (s) 


LOS 


8* 




57.3 ± 0.6 


1 


OVE 


8* 




23.0 ± 0.3 


< 1 


CGFA 


8* 


1.14 


13.9 ± 0.1 


4.56 ± 0.07 


MGFA 


8* 


1.10 


13.8 ± 0.1 


13.4 ± 0.2 


LOS 


12* 




117 ± 2 


18.4 ± 0.4 


OVE 


12* 




33.6 ± 0.5 


5.9 ± 0.1 


CGFA 


12* 


1.22 


16.4 ± 0.3 


34.1 ± 0.7 


MGFA 


12* 


1.24 


16.2 ± 1.0 


92 ± 2 


LOS 


16* 




198 ± 2 


« 1 


OVE 


16* 




46.3 ± 0.6 


< 1 


CGFA 


16* 


1.33 


17.6 ± 0.2 


2.12 ± 0.04 


MGFA 


16* 


1.26 


17.3 ± 0.1 


5.95 ± 0.08 


LOS 


32* 




640 ± 20 


83 ± 2 


OVE 


32* 




84 ± 2 


12.0 ± 0.3 


CGFA 


32* 


1.38 


21.5 ± 0.7 


49 ± 1 


MGFA 


32* 


1.35 


20.7 ± 2.0 


98 ± 3 


LOS 


32* 




1970 ± 90 


553 ± 30 


OVE 


32* 




252 ± 70 


75 ± 20 


CGFA 


32* 


1.34 


23.4 ± 0.8 


57 ± 2 


MGFA 


32* 


1.33 


22 ± 3 


123 ± 3 
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